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Abstract 

The  high  performance,  low  cost,  and  flexibility  of  com¬ 
modity  hardware  systems  make  them  appealing  for  net¬ 
work  processing  applications.  However,  the  standard 
software  architecture  of  such  systems  imposes  signifi¬ 
cant  limitations.  At  high  rates  (e.g..  Gigabit  Ethernet) 
and  small  frame  sizes  (64  byte)  each  frame  must  be  pro¬ 
cessed  in  less  than  672  ns.  System  calls,  synchronization, 
and  memory  latencies  can  dominate  this  processing  time. 
Despite  significant  effort  to  remove  this  overhead,  we  are 
aware  of  no  general  purpose  mechanism  that  can  handle 
this  load  on  commodity  hardware. 

This  paper  describes  the  frame-shared-memory  archi¬ 
tecture  (FShm),  a  general-purpose  software  architecture 
for  processing  network  frames  on  commodity  multi¬ 
processor  hardware.  FShm  supports  kernel-  and  user- 
space  processing  at  Gigabit  Ethernet  rates  by  increas¬ 
ing  throughput,  without  reducing  the  per-frame  process¬ 
ing  time,  by  pipelining  work  across  multiple  processors. 
FShm  can  generate,  capture,  and  forward  frames  at  the 
theoretical  maximum  rate  on  a  Gigabit  Ethernet  network 
for  all  frame  sizes  greater  than  96  bytes,  and  at  95%  of 
maximum  for  the  64  byte  minimum  frame  size  (the  limit 
of  the  tested  hardware). 

1  Introduction 

Commodity  hardware  systems  are  an  appealing  choice 
for  network  processing  applications  that  demand  flexi¬ 
bility  and  low  cost.  Unfortunately,  the  software  archi¬ 
tecture  of  these  systems  imposes  a  number  of  limitations 
on  the  types  of  services  that  they  can  provide,  especially 
for  frame  processing  tasks.  For  example,  consider  a  Net¬ 
work  Intrusion  Detection  (NID)  system  designed  to  iden¬ 
tify  malicious  traffic  on  a  network.  Since  malicious  traf¬ 
fic  may  occur  at  the  Ethernet  frame  level  in  any  frame, 
such  a  system  must  be  able  to  handle  data  at  the  frame 
level,  for  all  valid  frame  sizes,  at  the  wire  rate. 


For  larger  frame-sizes,  there  are  a  number  of  software 
techniques  to  process  data  arriving  over  a  very  high- 
bandwidth  link  [5,  12,  18],  such  as  a  Gigabit  Ethernet. 
Unfortunately,  as  the  frame  size  decreases,  the  total  time 
available  to  process  each  frame  also  decreases.  For  ex¬ 
ample,  for  a  Gigabit  Ethernet  link  saturated  with  64  byte 
frames  (the  minimum  frame  size),  one  has  only  672ns  to 
process  a  frame  [24].  However,  on  an  AMD  Opteron  or 
Intel  Xeon,  the  overhead  for  a  single  system  call  (exclud¬ 
ing  any  real  work  done  by  the  OS)  ranges  from  171ns 
to  534ns;  a  single  lock-unlock  pair  may  cost  as  much 
as  354ns.  These  overheads  are  a  significant  fraction  of 
the  total  frame  processing  time.  While  some  applications 
can  cope  by  dropping  frames,  in  other  applications  such 
as  NID,  a  frame  drop  may  be  considered  a  definitive  ap¬ 
plication  failure,  since  the  dropped  frame  may  contain  an 
attack  which  would  be  missed  by  the  NID  system. 

Modern  high-bandwidth  processing  techniques  elim¬ 
inate  almost  all  frame-processing  overhead,  either  by 
avoiding  locks  and  system  calls  altogether,  or  by  aggre¬ 
gating  data  transfers  to  amortize  the  cost  of  these  over¬ 
heads  across  many  frames.  Despite  such  effort,  no  exis¬ 
tent  frame  processing  technique  has  been  able  to  demon¬ 
strate  the  ability  to  process  large  numbers  of  small  frames 
at  Gigabit  Ethernet  rates  on  commodity  hardware. 

This  paper  presents  the  Frame-Shared-Memory 
(FShm)  software  architecture  for  high-rate  frame 
processing.  FShm  allows  for  processing  of  small 
frames  at  Gigabit  Ethernet  rates  in  user-space  with  no 
dropped  frames.  FShm  accomplishes  this  by  using 
multiple  processors  to  build  a  processing  pipeline  that 
improves  throughput,  but  does  not  decrease  the  available 
per-frame  processing  time  for  applications.  In  fact,  by 
utilizing  additional  processors,  FShm  may  increase  the 
per-frame  processing  time  for  applications.  This  allows 
FShm  to  scale  to  faster  future  networks  by  using  the 
extra  cores  that  future  commodity  multi -core  processors 
will  provide.  This  paper  describes  experiments  that 
demonstrate  the  transmission,  reception,  and  forwarding 
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of  Gigabit  Ethernet  frames  at  the  theoretical  maximum 
with  no  drops  for  frames  96B  and  larger,  and  at  at 
95%  of  the  theoretical  maximum  wire-rate  (the  most 
the  hardware  can  generate)  with  no  drops  for  64B 
frames.  Furthermore,  we  demonstrate  that  FShm  allows 
user-space  threads  to  perform  sophisticated  processing. 

The  remainder  of  this  paper  is  organized  as  follows. 
Section  2  precisely  describes  FShm’s  design  goals  and 
then  discusses  system  constraints  that  make  achieving 
these  goals  difficult.  Section  3  then  presents  the  design  of 
the  FShm  pipeline  and  discusses  how  FShm  overcomes 
each  of  the  bottlenecks  in  Section  2.  Section  4  evaluates 
how  well  FShm  meets  its  design  goals  by  measuring  its 
performance  at  Gigabit  rates  at  a  variety  of  frame  sizes 
for  a  range  of  tasks.  Section  5  presents  related  work  not 
covered  elsewhere  in  the  paper.  Section  6  concludes. 

2  Design  Criteria  and  Constraints 

This  section  defines  the  the  design  goals  for  FShm  and 
discusses  the  bottlenecks  present  in  commodity  systems. 

2.1  Desired  Features 

FShm  was  designed  to  satisfy  the  following  two  criteria: 

1.  FShm  must  provide  the  ability  to  transmit,  receive, 
and  forward  packets  at  close  to  the  wire-rate  (at  Gi¬ 
gabit  Ethernet  speeds,  for  the  purposes  of  this  pa¬ 
per)  without  increasing  the  drop  rate  imposed  by  the 
underlying  hardware  (ideally  dropping  no  frames). 

2.  FShm  must  have  overhead  that  is  small  enough  so 
real  work  can  be  accomplished  by  the  application 
when  dealing  with  a  network  saturated  with  small 
frames.  A  system  that  consumes  all  per-frame  pro¬ 
cessing  time  simply  to  deliver  data  to  and  send  data 
from  an  application  is  nearly  useless. 

In  addition  to  high  performance,  a  general-purpose 
framework  is  desirable.  Exactly  what  is  considered  gen¬ 
eral  purpose  is  clearly  influenced  by  the  targeted  applica¬ 
tion.  However,  certain  properties  are  required  by  a  large 
number  of  network  applications,  and  FShm  has  been  de¬ 
signed  to  support  these.  The  additional  criteria  imposed 
by  the  desire  for  a  general  purpose  system  are  as  follows: 

3.  FShm  must  be  able  to  process  every  frame  on  the 
incoming  network  interface.  To  understand  this  re¬ 
quirement,  consider  applications  such  as  network 
intrusion  detection  (NID).  Since  any  frame  may 
contain  an  attack,  dropped  frames  may  result  in 
missed  attacks.  If  a  system  has  trouble  at  high-rates, 
attackers  can  evade  detection  simply  by  sending  a 
flood  of  small  frames.  It  is  easy  to  imagine  addi¬ 
tional  applications  that  might  monitor  every  frame. 


4.  FShm  should  support  both  in-order  and  out-of-order 
frame  delivery.  Consider  a  router  application.  This 
application  will  likely  want  to  process  frames  out 
of  order,  to  boost  performance.  However,  a  NID 
system  may  require  in-order  delivery  of  frames. 

5.  FShm  should  support  user- space  processing  of 
frames  since  complex  applications,  such  as  NID 
or  content-based  routing,  may  involve  sophisticated 
user-specified  code  written  in  high-level  languages 
with  an  eye  towards  portability. 

6.  FShm  should  provide  a  mechanism  to  trade-off  la¬ 
tency  for  expanded  processing  time  without  sacri¬ 
ficing  throughput,  since  some  applications  require 
fairly  complex  operations  on  a  per-frame  basis. 


2.2  Network  Imposed  Constraints 

The  first  constraint  faced  by  any  frame  processing  system 
targeting  wire-rate  processing  with  no  drops  is  the  frame 
arrival  rate  (or  the  inter-frame  arrival  period).  Consider 
a  saturated  Gigabit  Ethernet  link.  On  such  a  link,  the 
inter-frame  arrival  period  is  given  by 


T  = 


InterframeGap  +  Pre&Postamble  +  FrameSize 
Bitrate 


where  the  InterframeGap  >  96  bits, 

Pre&Postamble  =  64  bits,  and  FrameSize  is  the 
frame  size  measured  in  bits  (including  the  32  bits  of 
CRC).  Setting  the  Bitrate  =  1  x  109  bps  and  the 
InterframeGap  =  96  bits,  we  can  readily  find  the 
maximum  frame  rate  at  different  frame  sizes.  For 
example,  with  64B  frames,  there  are  1,488,095  frames 
per  second  (fps)  implying  the  system  needs  to  be  ready 
for  a  frame  every  672  ns.  With  256B  frames  there  are 
only  452,951  fps  or  a  frame  every  2207  ns.  On  a  2GHz 
machine  (e.g.,  the  AMD  Opteron  on  which  FShm  was 
evaluated)  this  is  the  difference  between  1338  processor 
cycles  and  4397  cycles. 

The  smaller  the  frame  size,  the  more  difficult  it  is 
to  support  wire-rate,  since  the  frame-rate  is  consider¬ 
ably  higher,  even  though  the  total  bandwidth  utilized  is 
less  due  to  constant  per-frame  overhead  on  the  network. 
Contrast  the  1, 488, 095  fps  at  761  Mbps  for  64B  frames 
versus  the  meager  81,  274  fps  at  987  Mbps  for  1518B 
frames.  Thus,  the  subsequent  discussion  on  system  bot¬ 
tlenecks  focuses  on  the  64B  frame-size  with  the  implicit 
understanding  that  if  a  system  can  handle  64B  frames  it 
can  easily  handle  larger  frame  sizes. 


2.3  Operating  System  Bottlenecks 

To  understand  the  Operating  System  bottlenecks  in  frame 
processing,  it  is  useful  to  examine  the  application  flow 
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Receive  Transmit 


Figure  1:  Call  chain  for  the  Berkeley  Packet  Filter 


through  both  kernel  and  user-space  for  a  canonical  net¬ 
work  processing  system.  Figure  1  shows  the  flow  of 
the  send  and  receive  operation  in  the  Berkeley  Packet 
Filter  (BPF)  [23]  network  application  as  implemented 
in  FreeBSD,  which  is  representative  of  common  user- 
space  network  processing.  Here,  the  application  must  go 
through  six  or  more  levels  to  send  or  receive  a  packet.  On 
the  receive  side  libpcap[20]  is  typically  used  to  read  from 
the  BPF  interface  and  provide  a  cross-platform  frame  in¬ 
terface.  Libpcap  invokes  the  read  system-call  on  the  BPF 
device  node.  In  the  kernel,  the  VFS  layer  translates  this 
call  to  a  read  from  the  kernel-space  BPF  buffer.  This  read 
operation  invokes  uiomove  which  then  copies  the  buffer 
into  user-space.  At  interrupt  time,  when  the  DMA  trans¬ 
fer  of  new  frame  data  from  the  network  interface  card 
(NIC)  is  complete,  the  BPF  layer  copies  the  data  from 
the  driver  into  the  BPF  buffer.  Send  operates  in  reverse 
order.  The  application  calls  write  on  the  network  device, 
which  invokes  the  kernel.  The  VFS  layer  translates  this 
into  a  call  to  the  BPF  subsystem  which  uses  uiomove  to 
copy  the  user  data  into  kernel-space.  From  here,  the  data 
is  passed  to  the  Ethernet  driver  which  then  DMAs  the 
data  to  the  NIC. 

These  mechanisms  require  several  copies  (the  bcopy 
and  the  uiomove),  a  system-call,  and  passage  through 
the  VFS  layer.  Furthermore,  since  the  BPF  buffers  are 
accessed  simultaneously  by  the  Ethernet  driver  and  the 
BPF  kernel  code,  a  lock  must  be  acquired  and  released 
before  the  BPF  buffers  can  be  accessed.  The  three  main 
Operating  System  bottlenecks  in  this  process  are  the  sys¬ 
tem  calls,  lock  acquisition,  and  the  VFS  layer.  These  are 


Note:  For  VCPU=0,  no  scheduler  activations  are  used;  scheduler  acti¬ 
vations  result  in  reduced  performance  even  for  single  thread  tests. 


Table  1:  System-call  and  ioctl  costs  on  an  AMD  Opteron. 

discussed  below.  (Overhead  due  to  copies,  which  is  pri¬ 
marily  memory  overhead,  and  I/O  costs  are  covered  in 
Section  2.4.) 

2.3.1  System  Calls 

For  most  applications,  the  cost  of  system  calls  are  irrel¬ 
evant  as  they  occur  relatively  infrequently  compared  to 
other  application  operations.  However,  when  handling 
the  frame  rates  described  above,  they  quickly  overwhelm 
the  system’s  ability  to  process  frames  resulting  in  a  sig¬ 
nificant  number  of  dropped  frames.  Table  1  quantifies 
the  costs  of  performing  system  calls  on  a  dual  processor 
dual  core  2.0GHz  AMD  Opteron  270,  running  FreeBSD 
5.5  (the  fastest  configuration  we  had  available,  according 
to  our  measurements).  The  “#  VCPU”  column  indicates 
the  number  of  processors  available  to  the  application  (as 
defined  by  Scheduler  Activations  [1]),  and  the  “#  Thr.” 
column  indicates  the  number  of  concurrent  threads  in  the 
process  for  which  the  system-call  time  was  measured. 
Notice  that  the  Opteron  system  is  not  able  to  perform 
even  a  single  system  call  and  remain  within  the  672  ns 
service  time  for  a  single  64B  frame  on  a  saturated  Gi¬ 
gabit  Ethernet  link,  assuming  a  multi-threaded  environ¬ 
ment.  This  overhead  is  due  to  the  additional  complexity 
of  the  user-space  scheduler  provided  by  scheduler  acti¬ 
vations. 

Most  recent  prior  work  uses  shared  memory  to  trans¬ 
fer  data  between  the  NIC  device  driver  and  the  net¬ 
work  processing  application  to  avoid  the  cost  of  system 
calls  [29,  21],  As  Section  3  describes,  this  is  the  approach 
used  by  FShm  to  eliminate  system  calls  for  the  transfer 
of  data  between  different  contexts,  be  it  between  a  ker¬ 
nel  and  user  context  or  multiple  user  contexts  (i.e.,  pro¬ 
cesses). 

2.3.2  Locking  Mechanisms 

In  order  to  prevent  concurrent  access  to  a  single  buffer 
(resulting  in  corrupted  data),  most  production  operating 
systems  use  mutexes  (also  known  as  locks).  Table  2  enu¬ 
merates  the  costs  of  mutex  operations  in  kernel-  and  user- 
space  for  the  same  AMD  Opteron  described  earlier.  The 
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Lib 

Lock  (ns) 

Unlock  (ns) 

Lock  &  Unlock  (ns) 

pthread 

75 

78 

159 

kernel 

28 

19 

27 

Table  2:  Mutex  operation  costs  for  an  AMD  Opteron 

minimum  cost  of  passing  a  variable  between  two  threads, 
assuming  no  cache  misses  or  contention  on  the  locks,  is 
double  the  cost  of  a  single  lock/unlock  pair  since  both  the 
reader  and  writer  must  lock  the  data  structure  when  per¬ 
forming  the  respective  operation.  This  implies  that  on 
the  order  of  318  ns  (for  user<->user  communication)  or 
54  ns  (for  kcrneb->kcrncl/user)  would  be  spent  passing  a 
single  variable  out  of  a  budget  of  672  ns  for  64B  frames; 
this  is  intolerable  in  the  former  case  and  less  than  ideal  in 
the  latter.  Note  that  these  numbers  are  best-case  because 
the  mutex  variable  is  already  in  the  processor  cache. 

One  could  aggregate  writes  to  amortize  the  cost  of 
a  lock-unlock  pair.  However,  the  costs  of  such  ag¬ 
gregation  must  still  be  addressed.  Section  3  describes 
how  FShm  eliminates  the  need  for  explicit  locking  by 
utilizing  concurrent  lock-free  queues  and  leverages  the 
cache  coherency  manager  to  implicitly  achieve  the  lock¬ 
ing  needed  to  ensure  data  consistency. 

2.3.3  VFS  Layer  Overhead 

The  indirection  of  reads  and  writes  through  the  VFS 
(filesystem)  layer  of  the  kernel  incurs  the  largest  over¬ 
head.  Table  1  shows  the  cost  of  a  null  ioctl  call  (thus 
capturing  just  the  overhead  of  the  system-call  and  VFS 
layer)  for  the  same  Opteron  system.  We  can  see  that  the 
cost  of  an  ioctl  can  be  up  to  triple  that  of  a  plain  system 
call.  Recall  that  system  calls  alone  overwhelmed  the  672 
ns  processing  windows  for  64B  Gigabit  Ethernet  frames, 
passing  through  the  VFS  layer  is  clearly  intolerable  for 
FShm.  Section  3  describes  how  FShm’s  shared-memory 
architecture  avoids  both  the  system-call  and  VFS  layer 
costs. 

2.4  Hardware  Bottlenecks 

While  the  hardware  plays  a  prominent  role  in  the  cost 
of  the  various  OS  bottlenecks  described  above,  there  are 
other  potential  bottlenecks  in  the  hardware  that  are  more 
difficult  to  control.  Of  these,  the  overall  system  I/O  ar¬ 
chitecture  and  memory  subsystem  play  the  most  signif¬ 
icant  role.  Since  it  is  difficult  for  software  techniques 
(such  as  FShm  and  much  of  the  prior  work)  to  mitigate 
these  bottlenecks,  one  should  bear  in  mind  that  they  may 
ultimately  limit  the  performance  of  any  system.  Fortu¬ 
nately,  FShm  is  still  able  to  handle  64  byte  frames  at  Gi¬ 
gabit  Ethernet  rates  for  the  Opteron  system  on  which  it 


Figure  2:  Host  Bus  Architecture 


Figure  3:  Switched  Architecture 


was  evaluated  (See  Section  4). 

2.4.1  System  Architecture 

To  understand  the  potential  bottlenecks  in  the  system  ar¬ 
chitecture,  consider  Figure  2  and  Figure  3.  In  the  host 
bus  architecture  (Figure  2),  the  first  potential  bottleneck 
is  the  shared  communication  bus  between  the  two  pro¬ 
cessors.  Since  access  to  memory,  access  to  snoop  results 
from  other  processors,  and  access  to  the  PCI  bus  all  con¬ 
tend  for  the  same  interface,  it  can  become  a  bottleneck. 

Even  if  the  processor  front-side  bus  is  not  a  problem, 
the  internal  organization  of  the  North  Bridge  can  limit 
performance.  For  example,  it  may  be  the  case  that  a 
PCI  device  on  one  bus  cannot  DMA  to  memory  at  the 
same  time  that  a  processor  wishes  to  deliver  a  command 
to  the  other  PCI  bus.  Other  issues  in  the  North  Bridge 
design,  such  as  interrupt  scheduling  and  command  la¬ 
tency,  can  also  have  tremendous  impact  on  total  achiev¬ 
able  throughput  to  NICs  on  the  PCI  bus. 

Consider  the  data  in  Table  3  which  compares  the  num- 


System 

NIC 

PCI 

Bus 

Frame 

Size 

Max 

Frames/s 

1.0  GHz 

PHI 

Intel 

82545  GM 

64-bit 

66  MHz 

64B 

811  kfps 

2.66  GHz 

P4  Xeon 

Intel 

82545  GM 

64-bit 

133  MHz 

64B 

723  kfps 

Table  3:  Throughput  on  a  Pentium  III  &  Pentium  4  Xeon. 
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ber  of  minimum  size  Ethernet  frames  that  can  be  gener¬ 
ated  on  a  Pentium  III  system  and  Pentium  4  Xeon  system 
with  Linux  pktgen  [21].  Notice  that  the  Pentium  III  sys¬ 
tem  has  a  slower  PCI  bus,  is  a  slower  processor,  and  yet 
can  place  frames  on  the  network  faster  than  the  Pentium 
4  Xeon  system  (our  numbers  confirm  this).  However, 
the  Pentium  III  system  contains  a  high-end  server  moth¬ 
erboard  that  utilizes  the  ServerWorks  HE-SL  chipset, 
whereas  the  Pentium  4  Xeon  has  a  standard  motherboard. 
Here,  the  chipset  makes  all  the  difference. 

The  switched  system  architecture  shown  in  Figure  3  is 
an  improvement  over  the  bus-based  architecture  in  that 
multiple  independent  data  transactions  need  not  inter¬ 
fere.  For  example,  depending  on  the  network  topology 
in  the  switched  cloud,  it  is  possible  for  a  processor  to 
access  main-memory  while  another  processor  services  a 
cache  snoop  request.  This  reduces  the  contention  for  lim¬ 
ited  system  resources,  however,  bottlenecks  may  still  ex¬ 
ist  since  the  switched  interconnect  network  is  not  a  full 
crossbar.  The  AMD  Opteron  system  on  which  FShm  is 
evaluated  is  a  switched  architecture. 

2.4.2  Memory  Latency 

Another  major  obstacle  to  high-performance  frame  pro¬ 
cessing  is  the  discrepancy  between  processor  clock  rates 
and  main  memory.  It  can  take  hundreds  of  CPU  clock 
cycles  to  fetch  a  single  cache  line  from  the  memory  sub¬ 
system  (translating  to  50-100  ns).  Furthermore,  modern 
system  architectures  make  this  penalty  difficult  to  avoid. 
DMA  transfers  from  commodity  NICs  always  write  to 
main  memory  forcing  the  processor  to  fetch  data  from 
main  memory.  Since  cache  line  size  is  on  the  order  of 
frame  size  for  small  frames,  each  frame  requires  a  main 
memory  transaction.  Some  of  this  overhead  can  be  hid¬ 
den  through  prefetching  [7]  and  other  techniques,  but  the 
tight  time  bound  for  processing  64B  frames  at  gigabit 
rates,  limits  the  effectiveness  of  these  techniques. 

3  FShm  Design 

Meeting  the  design  goals  enumerated  in  Section  2.1  is 
difficult  given  the  overheads  in  the  existing  frame  socket 
interfaces.  Previous  work  has  reduced  the  per-frame 
costs  by  minimizing  this  overhead  via  aggregated  trans¬ 
fers  or  “safe”  zero-copy  semantics.  While  these  tech¬ 
niques  minimize  almost  all  the  overhead,  the  per-frame 
non-application  processing  time  still  exceeds  the  672  ns 
available  for  64B  frames  on  a  saturated  Gigabit  Ethernet 
link. 

This  section  describes  the  design  of  the  FShm  archi¬ 
tecture.  FShm  is  capable  of  processing  64B  frames  on 
a  saturated  Gigabit  Ethernet  link  with  no  drops.  FShm 


Figure  4:  BPF  Receive  (left)  and  FShm  receive  (right) 

concedes  that  the  prior  work  has  done  as  much  as  possi¬ 
ble  to  reduce  per-frame  processing  overhead  on  a  single 
processor  system.  Instead  of  targeting  overhead,  FShm 
uses  the  concept  of  pipelining  (extremely  common  in 
hardware  design)  to  distribute  the  work  for  each  frame 
across  multiple  processors  in  software.  This  allows  in¬ 
creased  frame  throughput  without  the  need  to  reduce  per- 
frame  processing  overhead.  In  fact,  by  increasing  the 
number  of  pipeline  stages,  FShm  can  expand  the  avail¬ 
able  time  to  process  each  frame  by  increasing  frame  la¬ 
tency,  but  still  maintaining  frame  throughput. 

To  understand  the  design  of  FShm,  this  section  first 
discuss  why  it  is  unlikely  that  per-frame  processing  time 
can  be  reduced  on  a  single  processor  machine.  The  sec¬ 
tion  then  introduces  the  pipeline  organization  used  in 
FShm  to  improve  throughput  and  increase  available  pro¬ 
cessing  time,  even  though  per-frame  processing  overhead 
is  not  reduced.  The  section  concludes  by  describing  how 
the  inter-stage  (i.e,  processor  to  processor)  communica¬ 
tion  mechanism  is  designed  to  avoid  lock-unlock  and 
system-call  overhead. 

3.1  Optimizing  the  Frame  Socket  Interface 

To  understand  why  single-processor  per-frame  process¬ 
ing  overhead  is  unlikely  to  be  reduced,  it  is  instructive 
to  examine  the  bottlenecks  in  the  frame  socket  interface 
(described  earlier  in  Section  2)  and  examine  how  prior 
work  mitigates  these  costs.  To  recap,  the  critical  bottle¬ 
necks  are  the  system  calls,  navigating  the  VFS  layer,  the 
double  copies  (i.e.,  the  bcopy  and  the  uiomove),  and  the 
locking  between  the  interrupt  handler  and  the  high-level 
of  the  kernel,  as  shown  in  Figure  4. 

Implementations  of  the  BPF  attempt  to  minimize  the 
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impact  of  the  copies,  locking,  and  system  calls  by  ag¬ 
gregating  multiple  frames  worth  of  data  into  a  single 
buffer  and  returning  the  entire  block  in  a  single  system 
call.  By  using  two  buffers,  the  lock  contention  between 
the  interrupt  handler  and  the  kernel  is  minimized  since 
the  lock  only  needs  to  be  acquired  when  swapping  the 
buffers  or  copying  the  contents  to  user-space.  Unfortu¬ 
nately  while  all  the  bottlenecks  are  minimized,  they  are 
not  eliminated.  By  design  the  bcopy  in  the  interrupt  han¬ 
dler  cannot  be  eliminated  or  minimized  as  the  system  is 
designed  to  forward  the  frame  to  the  network  stack  in  ad¬ 
dition  to  capturing  it.  Monitor  mode  prevents  the  frame 
from  traveling  the  network  stack,  however,  the  bcopy  is 
still  necessary  to  maintain  the  semantics  of  the  higher- 
level  API. 

Other  prior  work  focuses  on  eliminating  the  over¬ 
head  of  the  uiomove  operation  that  copies  frames  from 
kernel-to  user-space  as  copies  can  be  a  significant  bottle¬ 
neck  [6,  5,  18].  These  techniques  provide  loose  copy  se¬ 
mantics,  which  is  all  that  is  necessary  for  high-bandwidth 
(vs.  high  frame-rate)  applications.  Additional  work 
eliminated  the  copies  by  memory  mapping  a  shared  re¬ 
gion  of  kernel  memory  into  the  application’s  address 
space  [21].  These  solutions  minimize  or  eliminate  the 
cost  of  copying  but  retain  the  overhead  of  the  system 
calls  and  the  VFS  layer. 

Alternative  approaches  focus  on  minimizing  the  ker¬ 
nel  overhead  by  employing  NICs  capable  of  DMAing 
frames  directly  into  the  address  spaces  of  interested 
applications  [12,  30].  This  is  done  by  mapping  the 
NIC’s  buffer  rings  directly  into  an  application’s  address 
space  [9]  or  by  permitting  user-space  DMA. 

The  approaches  described  above  completely  elimi¬ 
nate  buffer  copying  via  shared  pages  across  the  user- 
space/kernel-space  boundary,  and  completely  eliminate 
system  calls  and  VFS  overhead  by  DMAing  directly  into 
user-space.  However,  none  techniques  can  eliminate  the 
overhead  of  DMA  transfers  and  other  memory  accesses. 
Furthermore,  none  of  these  techniques  allow  Gigabit  rate 
processing  of  small  frames,  even  though  overhead  is 
minimized.  As  a  result,  the  real  non-application  over¬ 
head  involved  in  processing  frames  on  commodity  hard¬ 
ware  exceeds  the  service  time  for  small  frames  at  gigabit 
rates. 

Therefore,  we  believe  it  is  necessary  to  develop  tech¬ 
niques  that  can  improve  throughput  without  requiring 
further  reductions  in  the  per-frame  processing  time  (in¬ 
cluding  overhead). 


3.2  The  FShm  Pipeline 

As  shown  in  Figure  4,  the  design  of  FShm  eliminates  the 
previously  described  overheads.  The  key  to  FShm’s  abil¬ 
ity  to  process  small  frames  at  gigabit  rates  comes  from 
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Figure  5:  The  Basic  FShm  Pipeline 


its  use  multiple  processors  (or  multiple  cores  on  multi¬ 
core  systems)  to  pipeline  per-frame  work.  This  improves 
frame  throughput  without  sacrificing  latency 

Figure  5  depicts  an  overview  of  the  FShm  processing 
pipeline.  From  the  diagram,  we  see  that  each  frame  takes 
T  nanoseconds  to  process.  From  the  previous  discussion, 
we  know  that  T  >  672 ns,  the  inter-frame  arrival  period 
for  64-byte  frames  on  a  saturated  Gigabit  Ethernet  link. 
While  we  cannot  reduce  this  processing  time,  it  is  pos¬ 
sible  to  improve  the  throughput  of  processing  by  using 
multiple  processors  as  shown  in  the  figure. 

In  the  general  case,  the  processing  of  each  frame  is  di¬ 
vided  into  at  least  three  stages.  Input  processing  (IP),  Ap¬ 
plication  processing  (APP),  and  Output  processing  (OP). 
There  are  degenerate  cases  where  there  is  no  IP  phase 
(e.g.,  frame  generation)  and  where  there  is  no  OP  phase 
(e.g.,  frame  capture).  Each  stage  is  executed  as  a  thread 
on  a  separate  processor.  Ignoring  communication  over¬ 
head,  and  assuming  an  ideal  division  of  labor,  we  see  that 
the  processing  for  each  frame  can  be  divided  into  stages 
with  duration  T/ 3  nanoseconds.  Thus,  each  frame  still 
takes  T  nanoseconds  to  go  from  input  to  output.  This 
agrees  with  the  earlier  assertion  that  per-frame  process¬ 
ing  time  cannot  be  reduced.  However,  by  dividing  the 
work  across  3  processors,  it  is  possible  to  initiate  pro¬ 
cessing  on  a  new  frame  every  T/3  nanoseconds,  tripling 
the  frame  throughput.  Because  there  is  still  only  one 
processor  handling  input  for  the  NIC(s),  and  another  for 
the  output,  no  mutual  exclusion  is  needed  to  manage 
the  resources  because  there  is  no  contention.  In  short, 
we  parallelize  frame  processing  without  introducing  con¬ 
tention. 

In  reality,  communication  overhead  is  introduced  by 
splitting  work  into  stages  that  run  on  separate  processors. 
Furthermore,  the  IP,  OP,  and  APP  stages  may  not  each 
take  a  third  of  the  total  time.  Typically,  the  APP  stage 
will  be  much  more  time  consuming  than  IP  and  OP.  How¬ 
ever,  even  in  this  case  throughput  still  increases,  though 
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put_nonblock (...)  { 

if  (NULL  ==  queue [head] ) 
{ 

queue [head]  =  ptr; 
head  =  NEXT (head) ; 

} 


get_nonblock  (...)  { 

if  (NULL  !=  queue  [tail]) 
{ 

ptr  =  queue [tail]; 
queue [tail]  =  NULL; 
tail  =  NEXT (tail) ; 

} 


Figure  7;  CLF  Queue:  Put_nonblocking 


Figure  6:  The  FShm  Pipeline  with  multiple  APP  Units 


not  optimally.  It  is  possible  to  further  improve  through¬ 
put  in  this  unbalanced  scenario  by  utilizing  additional 
processors  to  allow  multiple  APP  stages  to  run  in  par¬ 
allel.  Effectively,  this  expands  the  amount  of  processing 
time  per-frame  without  affecting  throughput.  Figure  6 
shows  such  an  example  where  the  APP  stage  takes  twice 
as  long  as  either  IP  or  OP  and  cannot  be  divided  amongst 
processors.  Here  by  utilizing  a  single  extra  processor,  it 
is  still  possible  to  quadruple  the  frame  throughput  versus 
the  original  single  processor  solution.  Again,  this  hap¬ 
pens  without  shortening  the  per-frame  processing  time. 

This  pipeline  structure  and  careful  implementations  of 
inter-process  communication,  allows  FShm  to  process 
frames  at  Gigabit  Ethernet  rates,  even  for  small  frames. 
With  shared  memory  inter-process  communication  be¬ 
tween  kernel  and  user  space,  pipeline  stages  can  occur 
in  either  in  kernel-space  or  in  user-space.  In  FShm,  we 
keep  the  IP  and  OP  phases  are  kept  in  kernel  space  to 
allow  sharing  of  resources  amongst  applications. 

3.3  CLF  Queues 

Though  Figures  5  and  6  show  constant  durations  for  each 
pipeline  stage,  in  reality  execution  time  may  vary.  To 
accommodate  this  variation,  FShm  passes  frames  from 
one  stage  to  the  next  using  point-to-point  queues.  Each 
pipeline  stage  writes  to  one  or  more  output  queues  and 
reads  from  a  single  input  queue.  Absent  careful  design¬ 
ing,  the  overhead  of  queuing  and  dequeuing  data  from 
could  easily  wipe-out  all  throughput  gains. 

In  order  to  minimize  the  overhead  of  communi¬ 
cating  among  pipeline  stages  (in  particular  the  lock¬ 
ing  overhead),  FShm  uses  concurrent  lock-free  (CLF) 
queues  [25],  Typically,  these  queues  require  no  locking 
provided  that  the  hardware  supports  an  atomic  memory 


exchange  instruction.  However,  FShm  uses  only  point- 
to-point  queues  with  a  single  thread  reading  and  writing 
each  queue.  In  this  special  case,  proper  queue  semantics 
can  be  guaranteed  using  the  hardware’s  cache  coherence 
mechanism  rather  than  expensive  high-level  mutual  ex¬ 
clusion  primitives  (see  implementation  notes  below). 

Notice  that  point-to-point  only  CLF  queues  in  FShm 
also  imply  that  there  is  no  high-level  contention  in  the 
queues.  Just  as  there  is  a  single  read  stage  for  the  input 
NIC  and  a  single  write  stage  for  the  output  NIC  to  avoid 
contention,  each  queue  only  has  a  single  writer  and  a  sin¬ 
gle  reader. 

At  the  low-level,  there  may  be  contention  for  the 
cache-lines  that  contain  the  queue  entries.  Ideally,  if 
there  are  two  active  processors  operating  on  a  single 
queue,  as  in  FShm,  they  will  be  processing  queue  entries 
separated  by  more  than  a  single  cache  line,  eliminating 
the  overhead  of  cache  coherence  due  to  this  contention. 
In  practice,  we  demonstrate  excellent  performance  even 
without  cache-line  separation. 

More  advanced  mechanisms  such  as  transactional 
memory  architectures  [15]  may  be  able  to  mask  the 
cache  coherence  time  when  two  processors  are  shar¬ 
ing  a  single  cache  line.  Furthermore,  the  transactional 
semantics  allow  low-overhead  multipoint-to-multipoint 
queues,  which  can  be  advantageous. 

The  biggest  advantage  of  CLF  queues  in  FShm  is  seen 
when  performing  cross-domain  messaging  (e.g.  process 
<-►  process  and  process  <-►  kernel).  In  traditional  message 
passing  systems  one  would  need  to  perform  an  expensive 
operation  to  synchronize  the  two  processors  or  spin  until 
it  is  safe  to  proceed.  Neither  technique  is  acceptable  for 
the  rates  possible  on  modern  networks.  This  does  not 
happen  with  CLF  queues. 
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3.3.1  CLF  Queue  Implementation  Notes 

The  implementation  of  CLF  queues  creates  a  per-queue 
shared  memory  region  containing  an  array  and  a  shared 
control  region  that  that  manifest  as  a  single  device  node 
to  user-space  applications.  Utilizing  device  nodes  pro¬ 
vides  a  simple  API  that  makes  it  easy  to  manage  device 
specific  calls  and  track  misbehaving  applications  that  for¬ 
get  to  disconnect  from  a  queue,  without  modifying  kernel 
data  structures.  Recall  from  Section  2  that  VFS  and  ioctl 
calls  are  prohibitively  expensive.  However,  by  FShm’s 
design,  these  are  only  accessed  on  startup  and  shutdown. 

Both  the  shared  array  and  control  regions  are  mapped 
read/write  as  there  is  a  small  amount  of  shared  state.  In 
addition  there  is  a  non-shared  memory  version  that  pro¬ 
vides  a  process-local  thread  <->  thread  communication 
mechanism  suitable  for  user-space  inter-thread  commu¬ 
nication. 

The  main  implementation  challenge  for  CLF  queues 
is  in  ordering  operations  such  that  they  only  rely  on 
memory  coherence,  not  the  memory  consistency  model 
(which  varies  across  processors)  or  high-level  mutual  ex¬ 
clusion  primitives.  Figure  7  shows  pseudo-code  for  the 
non-blocking  put  and  get  operations  using  the  shared  re¬ 
gions.  This  code  works  with  all  processor  consistency 
models,  as  only  coherence  is  necessary  for  mutual  exclu¬ 
sion.  To  see  this,  recall  that  the  queue  is  point-to-point 
so  there  is  only  a  single  thread  calling  put  and  a  single 
thread  calling  get.  Furthermore,  put  will  only  write  to 
the  shared  region  when  the  queue  node  is  NULL  and  get 
will  only  write  the  node  when  the  value  is  not  NULL. 
Finally,  since  stores  are  atomic  on  all  coherent  memory 
systems,  the  getter  and  putter  will  never  update  the  same 
node  simultaneously. 

Pipeline  stages  can  dynamically  switch  to  using  block¬ 
ing  operations  permitting  process  sleep  during  times  of 
reduced  load. 

3.4  BSD  Implementation  Notes 

Given  a  set  of  CLF  queues,  implementing  a  shared  buffer 
region  that  is  compatible  with  the  existing  BSD  mbuf 
message  passing  system  is  straightforward.  First  a  struc¬ 
ture  is  defined  that  contains  an  mbuf  as  the  first  item, 
permitting  us  to  typecast  between  the  two  as  needed;  we 
are  breaking  strict  type  rules  so  the  programmer  needs 
to  be  careful.  Additionally,  we  store  the  kernel  address 
of  the  structure  and  a  pointer  to  the  kernel  address  of  its 
associated  data  buffer.  The  data  buffer  is  functionally  the 
same  as  the  one  the  mbuf  cluster  allocator  would  create 
for  a  normal  mbuf  when  allocating  a  buffer  for  a  network 
interface.  For  convenience,  our  data  buffers  are  set  to 
2048B  easily  preserving  page  alignment  for  the  network 
interfaces.  We  track  kernel  addresses  as  they  are  constant 


for  all  applications;  presently  FreeBSD  does  not  export 
an  API  to  make  kernel  virtual  addresses  directly  accessi¬ 
ble  by  a  user-space  application. 

3.5  Network  Interfaces 

Given  the  general  purpose  nature  of  the  FShm  architec¬ 
ture,  the  integration  of  the  network  interface  is  not  a  pri¬ 
mary  concern.  A  variety  of  different  high-performance 
interconnection  architectures  are  possible.  We  chose 
to  slightly  modify  the  stock  drivers  in  FreeBSD  for 
the  Broadcom  5703  &  5704  and  Intel  1000/Pro  Ether¬ 
net  network  interfaces  and  attach  the  FShm  architec¬ 
ture  in  the  kernel.  Alternative  mechanisms  could  uti¬ 
lize  user-space  DMA  or  mapping  the  interface’s  trans¬ 
mit  and  receive  descriptor  rings  to  the  application  [9], 
We  chose  to  manage  our  interfaces  in  the  kernel  to 
demonstrate  kernel  ^process  communication.  Addition¬ 
ally,  FreeBSD  lacks  an  API  to  pin  user-space  threads 
which  is  needed  for  maximum  performance  with  a  user- 
space  interface. 

To  manage  the  network  interfaces  we  implemented 
two  additional  kernel  modules  to  handle  receiving  (IP) 
and  transmitting  (OP).  These  modules  each  are  capa¬ 
ble  of  mapping  one  or  more  interfaces  to  one  or  more 
queues.  A  more  detailed  description  of  these  modules 
can  be  found  in  Sections  4.3.1  &  4.3.2. 

Presently  the  modules  purely  poll  their  inputs  for 
work,  although  they  could  also  be  implemented  with  a 
hybrid  interrupt  and  polling  organization  to  conserve  sys¬ 
tem  resources  and  prevent  livelock  [26].  We  chose  not  to 
implement  this  as  we  are  demonstrating  the  limits  of  the 
system  where  every  processor  will  be  utilized  at  100%, 
and  therefore  interrupts  would  never  be  activated. 

3.6  Safety 

Safety  is  a  critical  concern  when  sharing  memory  and 
passing  memory  references  between  processes.  Any  pro¬ 
cess  at  any  time  can  misbehave  and  write  invalid  and  or 
inconsistent  values  into  the  memory  so  that  a  correctly 
behaving  process  reads  it  and  does  the  wrong  thing.  Ex¬ 
amples  of  errors  are  writing  an  invalid  address  causing 
the  reading  process  to  examine  the  wrong  memory  and 
failing  to  ensure  group-write  operations  are  atomic  in  na¬ 
ture.  In  its  evaluated  form,  FShm  assumes  that  applica¬ 
tions  sharing  the  FShm  memory  region  are  not  malicious 
and  makes  no  serious  effort  to  protect  itself  again  misbe¬ 
having  participants.  This  is  a  fair  assumption  as  an  ap¬ 
plication  could  be  written  as  a  single  large  multi-threaded 
application  with  user-space  access  to  the  network  drivers 
where  a  fault  would  still  terminate  the  application. 

This  does  not  imply  that  security  was  not  a  concern  in 
the  design  of  the  system  and  several  steps  were  taken 


to  architect  the  system  so  that  it  could  protect  itself 
when  necessary.  First,  access  to  the  queues  is  performed 
though  device  entries  and  thus  gains  the  underlying  se¬ 
curity  model  maintained  by  the  host  operating  system. 
Second,  the  queues  are  in  a  separate  region  from  the  sbuf 
regions  and  the  sbuf  regions  are  split  into  an  sbuf  region 
and  a  data  region.  This  division  makes  it  possible  for  the 
kernel  to  share  the  sbuf  header  as  read-only  with  user- 
space  applications  preventing  the  applications  from  cor¬ 
rupting  the  sbuf  header  itself  (e.g.  corrupting  data  point¬ 
ers  or  the  reference  count  field).  This  permits  the  appli¬ 
cation  to  only  manipulate  the  frame  data  or  the  reference 
in  the  queue.  Corruption  of  the  frame  data  is  not  a  con¬ 
cern  for  the  kernel  as  it  does  not  process  it.  However, 
corrupting  the  sbuf  reference  in  the  queue  is  a  concern 
and  can  be  dealt  with  by  checking  the  address  against  the 
known  bounds  of  the  sbuf  region.  Premature  processing 
of  a  frame  in  the  kernel  is  also  not  a  concern  for  the  ker¬ 
nel  itself  as  the  worst  case  scenario  is  transmission  of  a 
garbage  frame  resulting  in  application  failure  but  not  sys¬ 
tem  failure.  Bounds  checking  can  easily  be  performed 
in  the  available  time  for  processing  at  any  frame  size  as 
all  the  necessary  information  will  be  resident  in  the  first 
level  cache  or  registers. 

Recovering  from  leaked  buffers  is  the  only  difficult 
problem  for  the  application  to  deal  with  and  the  only  rea¬ 
sonable  recovery  mechanism  is  to  restart  the  entire  ap¬ 
plication.  Selective  restarting  of  application  components 
may  be  possible  for  certain  applications.  However,  the 
necessary  bookkeeping  adds  significant  overhead. 

4  Evaluation 

This  section  presents  an  evaluation  of  FShm. 

4.1  Evaluation  Criteria 

To  measure  the  effectiveness  of  FShm,  three  metrics  are 
used:  (1)  frame  latencies,  both  end-to-end  and  in-system, 

(2)  wall-clock  time  available  for  frame  processing,  and 

(3)  frame  drop  rates. 

Each  of  these  quantities  is  measured  in  frame  gener¬ 
ation,  frame  capture,  and  frame  forwarding  configura¬ 
tions,  under  a  variety  of  frame  sizes.  The  scenarios  will 
be  described  in  more  detail  below.  Drop  rates  are  mea¬ 
sured  by  comparing  how  many  frames  should  have  been 
processed  to  the  number  of  frames  that  were  actually 
processed.  Wall-clock  time  for  actual  work  is  measured 
by  monitoring  the  drop  rate  as  the  duration  of  a  dummy 
work  loop  is  steadily  increased.  The  time  taken  by  the 
dummy  work  loop  is  measured  using  fine-grain  timing 
instructions  (rdtsc  on  x86  platforms)  and  the  duration 
of  the  longest  work  loop  that  results  in  no  increase  in  the 


Figure  8:  Frame  Latencies 


drop-rate  gives  the  wall-clock  time  for  performing  actual 
work  on  a  frame. 

The  two  frame  latency  measurements  are  illustrated 
in  Figure  8.  For  end-to-end  latency,  since  the  sender 
and  receiver  are  on  the  same  machine,  the  end-to-end 
frame  latency  is  measured  by  inserting  a  64-bit  time- 
stamp  (via  the  processor  cycle-accurate  TSC  register  on 
the  x86  platform)  when  the  frame  is  sent  and  compar¬ 
ing  to  the  time  the  frame  is  received.  Internal  frame- 
latency  is  computed  by  time-stamping  the  frame  when  it 
is  first  seen  in  software  on  the  processing  machine  and 
then  comparing  it  to  the  time  the  frame  leaves  software 
control.  For  generation  and  capture  of  frames,  we  were 
unable  to  measure  the  end-to-end  frame  latency  because 
the  clocks  (i.e.,  TSC  register)  of  the  frame  generator  and 
frame  receiver  are  not  synchronized. 

4.2  Evaluation  Platform 

The  network  testing  environment  consisted  of  two  AMD 
Opteron  systems  connected  with  either  the  on-board 
Broadcom  5704C  (dual  Gigabit  Ethernet),  off-board  HP 
NC7771  (Broadcom  5703),  or  an  Intel  82545  GM  Net¬ 
work  Interface  Controller.  The  AMD  Opteron  systems 
were  based  on  the  Tyan  Thunder  K8SR  (S2881)  mother¬ 
board  with  dual  Opteron  270  dual-core  processors  run¬ 
ning  at  2GHz.  The  relevant  chipsets  on  the  motherboard 
are  the  AMD-8131  (HyperTransport  PCI-X  Tunnel)  and 
the  AMD-8111  (HyperTransport  I/O  Hub). 

4.3  Evaluation  Modules 

Each  evaluation  scenario  uses  a  configuration  with  in¬ 
stances  of  four  conceptual  modules:  receiver,  transmitter, 
frame  generator,  and  handler.  Each  module  corresponds 
to  a  pipeline  stage  in  Section  3  and  runs  as  a  thread  on 
a  dedicated  processor,  in  the  ideal  case.  The  details  of 
each  module  are  described  below. 

4.3.1  Receiver 

The  kernel  receiver  module  (KRx)  corresponds  to  the  IP 
stage  and  handles  the  task  of  moving  frames  from  the  in- 
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put  interfaces  to  the  associated  FShm  queues.  The  mod¬ 
ule  permits  every  NIC-to-queue  mapping  possible. 

The  receiver  performs  four  tasks  for  the  evaluation. 
First,  it  measures  the  average  amount  of  free  wall-clock 
time  per  frame  by  spinning  in  an  idle  loop.  This  gives 
a  measure  of  the  possible  duration  of  early  frame  pro¬ 
cessing  (e.g.  selecting  output  queues).  Second,  by  spin¬ 
ning  the  receiver  smooths  out  the  arrival  rate  of  frames 
to  make  frame  latency  calculations  more  accurate  by  re¬ 
ducing  bursting  due  to  batched  DMA.  Third,  the  receiver 
measures  the  time  it  takes  to  enqueue  the  frame  onto  its 
output  queue.  From  these  two  measures  we  can  compute 
the  amount  of  time  spent  polling  the  input  interfaces.  Fi¬ 
nally,  the  receiver  can  timestamp  frames  so  that  the  trans¬ 
mitter  can  measure  in-system  latency.  There  is  also  a 
specialized  receiver  module  that  computes  the  average 
frame  round-trip  time. 

4.3.2  Transmitter 

The  kernel  transmitter  module  (KTx)  corresponds  to  the 
OP  pipeline  stage.  It  handles  the  task  of  moving  frames 
from  a  set  of  incoming  FShm  queues  to  their  associated 
network  interfaces.  Again,  the  module  is  designed  to 
permit  every  queue  to  NIC  mapping  possible.  The  map¬ 
ping  of  multiple  queues  to  separate  network  interfaces  is 
demonstrated  in  Section  4.6. 

The  transmitter  performs  two  tasks  for  the  evaluation. 
It  optionally  timestamps  frames  just  before  it  enqueues 
them  onto  its  transmit  interface  permitting  round-trip  la¬ 
tency  measures  as  seen  by  the  receiver  module.  In- 
system  latency  is  also  measured  by  comparing  the  cur¬ 
rent  time  to  the  receiver  time-stamp  when  a  frame  is 
about  to  be  enqueued  to  its  output  network  interface. 

4.3.3  Generator 

Frame  generation  is  performed  by  three  different  mod¬ 
ules:  KSendl,  KSend,  and  USend.  KSendl  is  a  ded¬ 
icated  kernel  module  that  generates  a  single  frame  in  a 
buffer  and  enqueues  that  same  buffer  for  every  frame 
transmitted.  This  avoids  the  latency  in  flushing  the  frame 
from  the  cache  to  main  memory  so  the  network  card  can 
DMA  the  frame  without  stalling.  The  remaining  two 
modules  (KSend  and  USend)  operate  by  dequeuing  an 
sbuf  from  the  allocator  queue,  populating  its  data  buffer 
with  a  frame,  and  enqueuing  it  onto  the  output  queue 
for  the  transmitter  module  to  output.  KSend  operates  in 
kernel-space  and  USend  operates  in  user-space.  KSend 
and  USend  implement  degenerate  case  APP  stages. 

4.3.4  Handler 

The  handler  module  is  a  simple  APP  stage  whose  pri¬ 
mary  task  is  to  evaluate  how  much  wall-clock  work 
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Figure  9:  Setup:  Generation 


time  is  possible.  We  have  implemented  versions  of  the 
module  to  operate  in  both  kernel-space  (KHandle)  and 
user-space  (UHandle)  permitting  direct  evaluation  of 
the  user-space  scheduler.  The  effect  of  the  user-space 
scheduler  could  be  eliminated  with  process  pinning,  but 
FreeBSD  does  not  support  this.  The  handler  is  informed 
of  the  total  per-frame  service  time  allowing  it  to  identify 
time  taken  by  the  scheduler  in  the  absence  of  pinning. 

4.4  Generation 

The  generation  scenario  measures  how  well  FShm  allows 
kernel-space  and  user-space  code  to  generate  frames  for 
transmission.  For  this  set  of  experiments,  one  Opteron 
system  (the  evaluation  target)  running  FShm  is  used  as  a 
frame  generator.  The  NIC  is  connected  to  a  dummy  NIC 
that  will  sink  frames  at  whatever  rate  they  are  generated. 

The  high-level  arrangement  of  modules  on  the  evalu¬ 
ation  target  is  shown  in  Figure  9.  On  the  left,  we  have 
the  full  generation  setup  using  FShm.  Here,  the  USend 
or  KSend  module  pulls  a  free  buffer  (an  sbuf  in  partic¬ 
ular)  from  the  allocation  queue,  fills  the  buffer  and  then 
enqueues  the  sbuf  for  the  KTx  module.  The  KTx  mod¬ 
ule  deallocates  the  buffer,  sends  the  frame  and  adds  the 
now  free  sbuf  back  to  the  allocation  queue.  Each  queue  is 
an  FShm  CLF  queue  that  delivers  frames  in-order.  KTx 
transmits  all  frames  in-order.  The  right  of  the  figure 
shows  a  simplified  frame  generation  configuration  that 
is  used  to  test  the  maximum  transmission  rate  supported 
by  the  hardware.  Note  that  KSendl  communicates  di¬ 
rectly  with  the  device  driver  and  no  other  communication 
is  needed. 

Figure  10  depicts  the  results  for  the  frame  generation 
tests.  The  bar  labeled  Theoretic  Max  shows  the  maxi¬ 
mum  frame  rate  supported  by  Gigabit  Ethernet  for  the 
given  frame  size.  The  remaining  bars  show  the  actual 
transmission  rates  for  the  various  generation  configura¬ 
tions.  The  bars  labeled  KSendl  are  for  the  configuration 
to  the  right  of  Figure  9  and  the  remaining  bars  are  for 
the  configuration  to  the  left.  Recall  that  a  U  prefix  signi¬ 
fies  a  user-space  sender  and  the  K  prefix  a  kernel-space 
sender.  The  graph  shows  generation  rates  using  a  Broad- 
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Figure  10:  Generation  Rates 


Figure  11:  Setup:  Capture 


com  5703 -based  NIC  on  both  the  Opteron’s  133  MHz 
PCI  bus  and  the  100  MHz  PCI  bus.  The  bars  labeled 
libdnet  serve  as  a  reference  and  show  the  frame  trans¬ 
mission  rates  for  a  frame  generator  based  on  libdnet. 

Notice  that  for  all  but  the  64  byte  frame  size,  FShm  is 
able  to  generate  frames  at  the  theoretic  maximum  for  Gi¬ 
gabit  Ethernet.  Clock  skew  accounts  for  the  few  bars  that 
are  slightly  above  theoretic  maximum.  For  small  frames, 
libdnet  is  not  even  close  the  theoretic  maximum.  For  64 
byte  frames,  even  KSendl  only  comes  to  within  95%  of 
theoretic  max  showing  that  the  hardware  is  the  limiting 
factor.  The  FShm  numbers  lag  the  hardware  limit  by  a 
bit,  most  likely  due  to  cache  coherence  issues. 

For  all  future  experiments  in  which  a  frame  genera¬ 
tor  is  needed  to  saturate  the  link,  an  Opteron  system  us¬ 
ing  KSendl  with  the  NIC  on  the  133  MHz  bus  is  used 
to  generate  64B  frames.  The  same  Opteron  and  bus  are 
used  with  KSend  to  generate  the  other  frame  sizes. 

4.5  Capture 

In  this  scenario  a  frame  generator  is  connected  to  an¬ 
other  Opteron  system  (the  evaluation  target)  configured 
to  capture  frames.  The  configuration  of  FShm  is  shown 
in  Figure  1 1 .  Again,  each  module  runs  in  its  own  thread, 
and  all  queues  are  FShm  queues.  Here,  the  KRx  module 
pulls  a  free  sbuf  off  the  allocation  queue,  fills  the  sbuf 
with  data  from  the  NIC  and  enqueues  the  sbuf  for  the 
Handler.  The  Handler  consumes  the  frame  and  returns 
the  now  free  sbuf  to  the  allocation  queue. 

Results  are  shown  in  Figure  12  which  breaks  down 
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Figure  12:  Capture  Work 


the  amount  of  time  available  for  work  in-handler  (work 
time),  the  time  spent  forwarding  the  frame  to  the  next 
queue  (output  time),  and  the  time  taken  to  read  the  in¬ 
coming  frame  (the  input  time).  It  is  important  to  no¬ 
tice  that  the  time  spent  handling  the  queues  remains  con¬ 
stant  in  both  the  kernel  and  user-space  handler  modules 
demonstrating  FShm’s  scalability.  Note  that  no  frames 
are  dropped  at  any  frame  size. 

Compare  this  to  BPF  capturing  the  same  generated 
frame  stream,  using  stock  FreeBSD  drivers  and  the 
Broadcom  5703  NIC.  Here,  the  drop  rate  exceeded  75% 
for  small  frames.  Furthermore,  only  99.95%  of  frames 
were  captured,  even  for  the  largest  frame  sizes. 

Observed  latencies  for  the  kernel-space  handler  are  all 
bounded  by  twice  the  available  service  time.  Since  there 
are  two  pipeline  stages  in  this  degenerate  scenario  (IP 
and  APP),  this  means  that  all  frames  are  processed  with¬ 
out  drops.  Although  user  latency  could  not  be  measured 
directly,  the  average  latency  can  be  bounded  by  experi¬ 
mentation  and  is  within  acceptable  limits.  This  was  eval¬ 
uated  by  sending  109  frames  (i.e.,  64  GB  of  traffic  and 
ensuring  there  were  no  drops.  With  eleven  minutes  of 
sustained  traffic  we  can  be  confident  that  this  throughput 
can  be  sustained  indefinitely. 

4.6  Forwarding 

For  the  forwarding  experiments,  a  frame  generator  sends 
frames  to  an  Opteron  system  (the  evaluation  target).  The 
frames  forwarded  by  the  evaluation  target  are  then  re¬ 
turned  to  the  frame  generator  so  that  it  may  compute  end- 
to-end  latency,  drop  rates,  and  round-trip  time. 

Initially  we  evaluated  the  performance  of  a  single 
thread  forwarding  frames  from  one  NIC  to  another  with- 
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Figure  13:  Setup:  Forwarding 
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Figure  14:  Setup:  Forwarding  64-byte 


out  the  FShm  architecture  and  found  a  non-zero  drop 
rate  at  every  frame  size,  indicating  that  one  thread  cannot 
keep  up  with  Gigabit  Ethernet. 

For  the  FShm  test,  the  configuration  of  the  evaluation 
target  is  shown  in  Figure  13.  Forwarding  is  evaluated 
with  exactly  the  same  setup  as  the  capture  evaluation, 
only  instead  of  the  handler’s  output  queue  being  the  al¬ 
locator,  it  is  connected  to  a  KTx  module  that  outputs  the 
frames  on  the  appropriate  NIC,  in-order.  The  in-system 
latency,  in  the  forwarding  setup,  must  be  bounded  by 
three  times  the  service  rate  for  each  frame  as  this  setup 
has  three  pipeline  stages.  The  external  latency  is  shown 
in  Table  4.  This  is  measured  by  taking  the  average  round 
trip  time  and  dividing  by  two.  Dividing  the  external  la- 
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-space 
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Size 
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Time  (ns) 
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64B 

N/A 
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96B 
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128B 

484701 
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664873 

282 

192B 

529621 
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650713 
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256B 

716337 
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1212474 
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384B 
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1896708 
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512B 

1353471 

160 

2178294 

257 

768B 

1737235 

138 

2750763 

219 
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2233380 

134 

3639886 

219 

1518B 

3207739 

131 

5239745 

214 

Note:  times  for  64B  frames  were  too  small  for  accurate  measurement. 


Table  4:  Forwarding  System  Latency 
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Figure  15:  Forwarding  Work 


tency  by  the  inter-frame  arrival  period,  we  can  confirm 
that  NIC  queuing  is  causing  the  high  latencies. 

There  is  some  difficulty  evaluating  forwarding  of  64- 
byte  frames.  Section  4.4  shows  that  there  is  an  hardware 
limitation  in  generating  frames  to  a  separate  buffer  at  the 
full  rate.  This  implies  that  to  forward  64  byte  frames  at 
the  maximum  arrival  rate,  we  need  to  use  multiple  out¬ 
put  NICs.  Therefore  we  used  both  output  ports  on  the 
Opteron’s  on-board  Broadcom  5704C  NIC  which  inter¬ 
nally  times  its  DMAs  to  avoid  interference  on  the  bus. 
Also,  a  side  effect  of  generating  at  the  95%  transmission 
rate  using  the  KSendl  module  is  that  it  is  impossible  to 
insert  a  timestamp  into  outbound  frames  to  measure  ex¬ 
ternal  latency.  Furthermore,  due  to  cache  miss  latencies 
in  the  KTx  and  KRx  modules,  it  was  not  possible  to  in¬ 
sert  timestamps  to  compute  in-system  latency.  To  com¬ 
pensate  for  the  absence  of  direct  measurement  we  used 
the  sustained  traffic  technique  from  Section  4.5. 

Figure  15  shows  the  amount  of  useful  work  that  the 
handler  can  perform  before  the  system  begins  to  drop 
frames.  The  bar  to  the  left  of  each  bar-pair  shows  the 
amount  of  work  (and  overhead)  that  can  be  done  in  the  IP 
stage  (i.e.,  the  receiver  module)  and  the  bar  on  the  right 
quantifies  the  work  in  the  APP  (i.e.,  the  handler  module) 
stage.  Notice  that  most  of  the  time  at  large  frame  sizes 
is  available  for  work.  Note  further  that  the  total  avail¬ 
able  work  time  is  more  that  twice  the  actual  inter-frame 
arrival  period,  verifying  that  FShm  does  indeed  permit 
time-expansion  through  pipelining. 

At  smaller  frame  sizes,  less  work  is  possible.  At  the 
64  byte  frame  size,  FShm  consumes  almost  all  avail¬ 
able  time  leaving  no  time  for  work.  However,  recall  that 
FShm  can  expand  work-time  with  additional  processors, 
it  can  still  allow  real  work  to  be  performed.  For  example. 
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the  handler  could  forward  frames  to  N  additional  han¬ 
dlers.  This  would  give  N  times  the  per-frame  processing 
time.  The  stream  would  then  be  reassembled  at  the  out¬ 
put  module.  While  existing  commodity  systems  lack  the 
processors  to  do  this,  we  have  demonstrated  that  the  han¬ 
dler  can  split  an  incoming  stream  of  64  byte  frames  with 
no  drops  by  testing  the  configuration  in  Figure  14.  With 
the  upcoming  quad  core  dual  processor  systems,  FShm 
will  be  more  than  able  to  do  zero-drop  in-order  process¬ 
ing  of  64  byte  frames  and  still  allow  actual  work  to  be 
accomplished,  even  though  the  original  configuration  in 
Figure  13  does  not. 

5  Related  Work 

As  outlined  in  the  prior  sections,  the  FShm  architecture 
builds  upon  a  rich  body  of  prior  work  by  synthesizing 
ideas  from  the  areas  of  operating  system  design  [3,  13, 
16,  22],  message  passing  [2,  6,  28,  11,  5,  30,  12,  18], 
networking  architecture  [8,  10,  14,  19,  23,  31],  and  high 
performance  computing[22,  25]. 

Of  these  works,  the  Synthesis  Kernel[22]  has  been 
most  influential.  The  Synthesis  Kernel  pervasively  em¬ 
ployed  concurrent  lock-free  queues  for  passing  messages 
between  tasks,  and  is  the  inspiration  for  many  of  the 
ideas  in  FShm’s  communication  mechanism.  Specifi¬ 
cally,  FShm  utilizes  the  single -producer/single-consumer 
array  based  queue  for  all  message-passing. 

Both  FShm  and  NetTap[4],  an  existing  system  to  per¬ 
mit  high-speed  user-space  processing  of  packets,  use 
CLF  queues  and  shared  memory  to  communicate  be¬ 
tween  kernel  and  user-space  contexts.  NetTap  focused  on 
providing  a  specific  API  that  was  optimized  for  the  single 
processor  machines  that  were  available  on  the  commod¬ 
ity  market  at  the  time.  They  achieve  satisfactory  results. 
FShm  concerns  itself  with  providing  a  general  purpose 
API  designed  to  efficiently  utilize  multi-processor  and 
multi-core  commodity  systems.  Furthermore,  given  the 
time  of  publication,  NetTap  was  evaluated  on  100  Mbps 
Ethernet  and  may  not  fare  well  on  modern  networks. 

Recently  Intel  has  been  working  on  extending  the  ar¬ 
chitectures  of  the  Virtual  Interface  Architecture  [12]  and 
InfiniBand  [17].  Their  goal  is  to  accelerate  the  pro¬ 
cessing  of  TCP/IP  packets  by  dedicating  a  processor  to 
act  as  a  TCP/IP  onloading  engine  [29].  This  processor 
would  preprocess  packets  and  deliver  them  to  applica¬ 
tions  through  a  set  of  queue  structures  with  an  interface 
optimized  for  TCP/IP.  However,  ETA  does  not  pipeline 
frame  processing  and  is  not  evaluated  on  small  frame 
transmissions.  Furthermore,  FShm  remains  agnostic  to 
the  application  whereas  ETA  is  TCP/IP  centric. 

Finally,  the  Click  Modular  Router[19]  is  a  network 
processing  architecture  with  goals  similar  to  those  of 
the  FShm.  Click  succeeds  at  providing  a  modular 


network  processing  infrastructure  with  general  purpose 
user-space  processing.  The  key  to  their  success  is  that 
modules  can  be  tested  in  user-space  and  then  recompiled 
as  a  kernel  module,  thereby  avoiding  expensive  commu¬ 
nication  between  kernel-and  user-space.  By  combining 
FShm  and  Click  one  can  gain  the  benefits  of  both. 

6  Conclusion 

The  flexibility  and  low  cost  of  commodity  hardware  sys¬ 
tems  make  them  an  appealing  choice  for  network  pro¬ 
cessing  applications.  However,  as  seen  from  Section  2, 
several  bottlenecks  in  the  software  architecture  make 
processing  frames  at  Gigabit  Ethernet  rates  difficult.  Fur¬ 
thermore  applications  such  as  network  intrusion  detec¬ 
tion  (NID)  must  be  able  to  process  all  frame  sizes  on 
saturated  links.  Prior  to  this  work,  this  has  not  demon¬ 
strated  on  commodity  hardware.  Worse  still,  it  appears 
as  if  prior  work  has  eliminated  almost  all  overhead  for 
frame  processing  on  single  processor  machines. 

To  surmount  the  barrier  to  wire-rate  small  frame  pro¬ 
cessing,  The  FShm  frame  processing  architecture,  pre¬ 
sented  here,  increases  frame  throughput  by  pipelin¬ 
ing  frame  processing  operations  across  multiple  proces¬ 
sors.  To  reduce  the  overhead  of  communicating  between 
pipeline  stages,  FShm  uses  shared  memory  and  concur¬ 
rent  lock-free  queues  to  permit  user-space  applications 
to  process  all  frames  on  a  saturated  Gigabit  Ethernet 
link,  at  all  frame  sizes,  including  64  byte  frames.  Fur¬ 
thermore,  because  FShm  does  not  discriminate  between 
user<-»user,  kerne  I  ^kernel,  and  kernel<-»user  communi¬ 
cation,  it  allows  pipeline  stages  in  both  kernel  and  user- 
space.  Finally,  since  the  number  of  pipeline  stages  is 
limited  only  by  processor  resources,  FShm  can  be  used 
to  split  frame  streams  and  expand  per-frame  processing 
time  with  additional  processors.  This  makes  FShm  ideal 
for  today’s  and  the  future’s  multi-core  systems.  Using 
FShm,  we  demonstrate  that  (1)  FShm  can  indeed  process 
all  frame  sizes  at  Gigabit  Ethernet  rates  with  no  frame 
drops,  (2)  quantify  that  FShm  permits  useful  per-frame 
work,  and  (3)  that  FShm  can  at  least  double  the  per-frame 
processing  time  through  use  of  multiple  processors. 
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8  Availability 

Project  information  and  source  will  be  made  avail¬ 
able  at  http://systems.cs.colorado.edu/ 

mediawiki/ index .php/FShm 
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